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Abstract — In this paper, an information theoretic analysis on 
non-adaptive group testing schemes based on sparse pooling 
graphs is presented. The binary status of objects to be tested are 
modeled by i.i.d. Bernoulli random variables with probability 
p. An (I, r, n) -regular pooling graph is a bipartite graph with 
left node degree I and right node degree r, where n is the 
number of left nodes. Two scenarios are considered; one is 
the noiseless setting and another is the noisy setting. The main 
contributions of this paper are direct part theorems which give 
conditions for existence of an estimator achieving arbitrary small 
estimation error probability. The direct part theorems are proved 
by averaging an upper bound on estimation error probability of 
the typical set estimator over an (I, r, n) -regular pooling graph 
ensemble. The numerical results obtained here indicate sharp 
threshold behaviors in the asymptotic regime. 

I. Introduction 

The paper by Dorfman [8| introduced the idea of group 
testing and also presented a simple analysis which indicates 
advantages of the idea. His main motivation was to devise an 
economical way to detect infected persons among a population 
by using blood tests. It is assumed that an outcome of a blood 
test tells whether the blood used in the test contains certain 
target viruses (or bacteria) or not. 

Of course, blood tests for each person in the population 
clearly distinguishes infected individuals from the persons who 
are not infected. The Dorfman's idea for reducing the number 
of tests is the following. We first divide the population into 
several disjoint groups and then make pools in such a way 
that a pool is a mixture of the bloods of persons in a group. 
The whole test process consists of two-stages. In the first 
stage, blood tests are carried out for each pool and the pools 
containing infected bloods are detected. In the second stage, 
all the individuals in the groups giving positive results are 
tested. Numerical examples showing certain reductions of the 
number of tests without loosing the detection capability are 
shown in JSJ. 

The Dorfman's invention triggered emergence of subsequent 
theoretical works on group testing and varieties of practical 
applications such as DNA clone library screening, detection 
of faulty parts of machines [9| [10|. Recent advancements of 
theories of compressed sensing [6 | stimulate research activities 
on theoretical aspects of group testing as well. 

The group testing scheme due to Dorfman can be classified 
into the class called adaptive group testing such that a part 
of test are designed based on the partial results of preceding 



test results. Another class of the group testing is called non- 
adaptive group testing, in which a pool design is completely 
given in advance of the execution of tests. Intuitively, the 
adaptive group testing is natural and advantageous over the 
non-adaptive group testing because the number of required 
tests is fewer than that of non-adaptive group tests. However, 
the non-adaptive group testing also provides its own benefits 
such that all the tests can be executed in parallel. Note that the 
adaptive group testing requires sequential tests which leads to 
certain restrictions on parallel execution of the tests. 

In order to develop a non-adaptive group testing scheme, 
pool design is crucial for achieving reasonably good detection 
performance. In the field of combinatorial group testing, a 
pooling matrix which defines the set of pools to be tested is 
constructed by using tools and theories from combinatorial 
design and combinatorics. Deterministic constructions for K- 
disjunct matrix are one of the central themes in the combi- 
natorial group testing [9| [10|. For a A"-disjunct matrix, there 
are simple and efficient reconstruction algorithms that realize 
correct estimation for if -sparse instances. 

Another class of construction for generating a pooling 
matrix is random construction; i.e., (0, l)-elements of a 
pooling matrix is probabilistically determined. Several re- 
construction algorithms have been proposed for such prob- 
abilistically constructed pooling matrices. For example, Se- 
jdinovic and Johnson [18], Kanamori et al. [13] recently 
proposed belief propagation-based reconstruction algorithms. 
Malioutov and Malyutov 1161 . Chan et al. [5] studied linear 
programming(LP)-based reconstruction. 

From theoretical point of view, clarifying the scaling behav- 
ior of the number of required tests for correct reconstruction 
has been one of the most important topics in this field. Berger 
and Levenshtein (2) studied a two-stage group testing scheme 
and unveiled the scaling law for the number of required tests 
based on an information theoretic framework. Mezard and 
Toninelli [17] provided a novel analysis for two-stage schemes 
based on theoretical techniques from statistical mechanics. 
Recently, Atia and Saligrama [1| presents an information 
theoretic analysis for non-adaptive group testing with/without 
noises. They showed a direct part theorem which gives a con- 
dition for existence of an estimator achieving arbitrary small 
estimation error probability and a converse part theorem which 
gives a condition for non-existence of good estimators. Their 
argument for the proof of these theorems are based on the 



proof of channel coding theorems for multiple access channels 
and it can be applied to both noiseless and noisy observations. 
For example, in the noiseless case, it is shown that a X-sparse 
instance of n-objects can be perfectly recovered from the test 
results if the number of tests is asymptotically 0(K\ogn). 

The main motivation of this work is to provide another 
information theoretic analysis for non-adaptive group testing 
based on sparse pooling graphs. In this paper, we assume that 
status (0 or 1) of n-objects (persons) are modeled by i.i.d. 
Bernoulli random variables with probability p. In other words, 
we here consider the scenario where the sparsity parameter K 
scales as K ~ pn asymptotically. In most of conventional 
analysis such as JT], K is assumed to be independent of n. 
Such assumption is reasonable to clarify the dependency of 
the required number of tests to the sparsity parameter and the 
number of objects. Although our assumption is different from 
the conventional one, it may be also natural in information 
theoretic sense and suitable for observing sharp threshold 
behaviors in asymptotic regime. 

Another new ingredient of this work is that the analysis is 
carried out under the assumption of an (I, r, n)-regular pooling 
graph ensemble which is a bipartite graph ensemble with left 
node degree I and right node degree r, where n is the number 
of left nodes. This model is suitable for handling a very 
sparse pooling matrix and amenable for an ensemble analysis. 
We will present direct and converse theorems which predict 
the asymptotic behavior of a group testing scheme with an 
(I, r, n)-pooling graph. The asymptotic conditions appeared in 
the direct and converse theorems are parameterized by p, I 
and r. Therefore, for a given pair (7, r), an achievable region 
for p, where arbitrary accurate estimation is possible, can be 
clarified. Our analysis is inspired from the analysis of low- 
density parity-check (LDPC) codes due to Gallager and others 

mil in ma. 

II. Preliminaries 

In this section, we will introduce two scenarios for group 
testing to be discussed in this paper. The first one is the 
noiseless system where test results can be seen as a function 
of an input vector. The second one is the noisy system where 
an test result are disturbed by additive noises. 

A. Problem setting for noiseless system 

The random variable X = [X\,...,X n ) represents the 
status of n-objects. We assume that Xi(i G [l,n]) is an i.i.d. 
Bernoulli random variable with the probability distribution 
Pr(Xi = 0) = 1 -p, Pr(X, = l)=p(0<p< 1). The nota- 
tion [a, b] represents the set of consecutive integers from a to b. 
It might be abuse of notation but the notation [a, b] is also used 
for representing closed interval over R if there are no fear of 
confusion. A realization of X is denoted by x — (xx, . . . , x n ). 
The test function OR(zi, . . . , z r ) : {0, l} r {0,1} is the 
logical OR (disjunctive) function with r-arguments (r is a 
positive integer) defined by 



The results of pooling tests which is abbreviated as test results 
are represented by Y = (Y±, . . . , Y m ). A realization of Y is 
denoted by y = (y 1 ,...,y m ). 

Let G = (Vl, Vr, E) be a bipartite graph, called a pooling 
graph, with the following properties. The n-nodes in Vl are 
called left nodes and other m-nodes in Vr are called right 
nodes. The set E represents the set of edges. For convince, 
we assume that left nodes are labeled from 1 to n. The left 
node with label i G [1, n] (for simplicity, we call it as the left 
node i hereafter) corresponds to X*. In a similar manner, right 
nodes are labeled from 1 to m. In this paper, G is assumed to 
be an (I, r, n) -regular bipartite graph, which means that any 
left and right nodes have its degree / and r and that the number 
of the left nodes is n. 

For the right node j 6 the neighbor set of the node 

j is defined by M(j) = {i G [l,n] | G E }. We are 

now ready to describe the relationship between X and Y, For 
a given pooling graph G, Yj(j G [1, m]) are related to Xi(i G 
[l,n]) by Yj = OR(X i ) ieM {j)- The notation (Xi) ieM ^ 
represents (Xj 1 , . . . , Xj r ) when M (j) = {Xj 1 , . . . , Xj r }. 
Namely, a pooling graph G defines a function from X to Y. 
We will denote this relationship as Y = Fg(X) for short. 

The goal of an examiner to infer the realization of a hidden 
random variable X from the test observation y as correct as 
possible. Assume that the examiner uses an estimator (i.e., 
estimation function) $ : {0, 1}™ — > {0, 1}" for the inference. 
The estimator gives an estimate of x, x = $(y), from the test 
observation y. The estimator $ should be chosen to reduce 
the estimation error probability P e = Pr ((&(Fg(X)) ^ X) 
as small as possible. 

B. Problem setting for noisy system 

The problem setting for the noisy system is almost same as 
the setting for the noiseless system described in the previous 
subsection. The crucial difference between the noiseless sys- 
tem and the noisy system is the assumption on the observation 
noises. In the case of the noisy system, the examiner observes 
a realization of random variable Z defined by 



Z = Y + E = F G (X)+E, 



(2) 



OR{z\, ...,z r ) 



A 



0, zi — z 2 — 

1, otherwise. 







(1) 



where E = (Ei, . . . , E m ) represents an observation noise. We 
here simply assume that Ei(i G [l,rn]) is an i.i.d. Bernoulli 
random variable with the probability distribution Pr(Ei — 
0) = l-q,Pr(E i = l) = q(0<q<l). 

III. Converse Part Analysis 

In this section, lower bounds on estimation error probability 
for the noiseless/noisy systems will be shown. The key of 
the proofs are Fano's inequality that ties estimation error 
probability to the conditional entropy. 

A. Lower bound for noiseless system 

Fano's inequality is an inequality that relates the conditional 
entropy to the estimation error probability and it has been 
often used as the main tool for the converse part of a channel 



coding theorem [7|. This inequality plays also a crucial role in 
the following analysis in order to clarify the limit of accurate 
estimation for the noiseless and noisy systems. 

Lemma 1 (Fano 's inequality): Assume that random variables 
A, B are given. The cardinality of the domains (alphabets) of 
A and B are assumed to be finite. For any estimator <fi for 
estimating the hidden value of A from the observation of B, 
the inequality 1 + Pr(A ^ <p(B)) log 2 |.4| > H(A\B) holds 
where the domain of A is denoted by A. □ 

We here use the Fano's inequality for deriving a lower 
bound on the error probability of an estimation for the noise- 
less system. Note that this lower bound does not depend on 
the choice of a pooling graph and an estimator. The proof of 
the theorem is resemble to the proof of the upper bound on 
code rate for LDPC codes ifTTI PI . Similar argument can be 
found in Gl, as well. 

Theorem 1 (Noiseless system): Assume the noiseless system. 
For any pair of an (I, r, ?i)-pooling graph and an estimator, the 
error probability P e is bounded from below by 

h(p)--h{{l-pf)--<P e . (3) 
r n 

(Proof) For any estimator having the error probability P e , we 
have 

H(X) = I(X;Y) + H(X\Y) 

< I(X;Y) + l + P e log 2 \X\ (4) 
= H (Y) - H(Y\X) + 1 + P e n (5) 
= H(Y) + 1 + P e n. (6) 

The inequality (O is due to Fano's inequality. The equation (|5]l 
holds since X — {0, 1}™. Note that, in the noiseless system, 
the random variable Y is a function of X, namely Y = Fq(X) 
and that it implies H(Y\X) = 0. The last equality (0 is the 
consequence of H(Y\X) = 0. 

Since we have assumed that X = (Xi, . . . ,X n ) is an n- 
tuple of i.i.d. Bernoulli random variables, the entropy of X 
is given by H{X) = nh(p) where h(p) is the binary entropy 

function defined by h(p) = —p\og 2 p — (1 — p) log 2 (l — p). 
We thus have 

nh(p) < H{Y) + 1 + P e n. (7) 

For further evaluation, we require to evaluate H(Y) = 
H(Yi, . . . , Y m ). It should be remarked that the random vari- 
ables Yi,Y 2 , . . . ,Y n are binary random variables and they are 
correlated in general. A simple upper bound on H{Y) can 
be obtained as H{Y U Y 2 , . . . , Y m ) < J2T=i H (^)- This is 
simply due to the chain rule and a property of the conditional 
probabilities (i.e., conditioning reduces entropy [7 1). From our 
assumptions that Yj = OR(Xi) ieM ^(j G and that 

\M{j)\ = r{j G [l,m]), we have H(Yj) = h((l - p) r ) 
because Pr[Yj = 0] = (1 — p) r . Combining these results, 
we obtain an inequality nh(p) < mh((l — p) r ) + 1 + P e n. 
From this inequality, we immediately obtain the claim of this 
theorem. n 



B. Lower bound for noisy system 

Let us recall the problem setup for the noisy system. The 
random variable Z = (Zi, . . . , Z m ) representing a noisy 
observation is defined by Z = Y + E = Fg{X) + E. As 
in the case of the noiseless system, a lower bound on the 
error probability for the noisy system can be derived based on 
Fano's inequality. 

Theorem 2 (Noisy system): Assume the noiseless system. 
For any pair of an (I, r, n) -pooling graph and an estimator, 
the error probability P e is bounded from below by 

h(p) + -h(q)- l -h((l-pY(l-q) + (l-^-pY)<l)-- < Pe- 

n (g) 

IV. Direct Part Analysis 

In the previous section, we have discussed the limitation of 
accurate estimation of any estimator; i.e., a lower bound on 
error probability. This result is similar to the converse part of a 
coding theorem. In this section, we shall discuss a direct part; 
i.e., existence a sequence of estimators achieving arbitrary 
small error probability. As in the case of coding theorems, we 
here rely on the standard bin coding argument Q to prove the 
main theorems. In order to apply such an information theoretic 
argument, we here introduce a novel class of estimators which 
is called a typical set estimator. 

A. Pooling graph ensemble 

In the following analysis, we will take ensemble average 
of the error probability of the typical set estimator over an 
ensemble of pooling graphs. The pooling graph ensemble 
introduced below is resembled to the bipartite graph ensemble 
for regular LDPC codes. The following definition gives the 
details of the pooling graph ensemble [14|. 

Definition 1 (Pooling graph ensemble): Let G; irj „ be the 
set of all (I, r, n) -regular bipartite graphs with n-left and m = 
(7/r)n-right nodes. The cardinality of G/ jr .„ is (nl)\. Assume 
that equal probability P(G) — l/(nZ)! is assigned for each 
graph G G Gi tr<n . The probability space defined based on 
the pair (G/, ri „, P) is referred to as the (I, r, n)-pooling graph 
ensemble. rj 

In order to prove the direct part theorems, we need to 
evaluate the expectation of the number of typical sequences 
x satisfying y = Fg(x) over the (I, r, n)-pooling graph 
ensemble. The next lemma is required for deriving the main 
theorems. 

Lemma 2: Assume that s G [0, m] and w G [0, n] are given. 
Let y s G {0,1}™ be a binary m-tuple with weight s and x w G 
{0, 1}™ be a binary n-tuple with weight w. The probability of 
the event y s = Fg(x w ) is given by 

E [I[y s = F G (x w )}} - -l-Coeff[((l + z) r - z lw ], (9) 

where Coeff [g(z), z l ] represents the coefficient of z % in the 
polynomial g(z). The function I[cond] is the indicator function 
taking value 1 if cond is true; otherwise it gives value 0. 



The combinatorial argument presented in the proof of 
Lemma [2] is closely related to the derivation of an average 
input-output weight distribution of LDPC codes over a regular 
bipartite graph ensemble due to Hsu and Anastasopoulos [12|. 

B. Analysis on error probability for noiseless system 

In this subsection, we will define the typical set estimator for 
the noiseless system and give an error performance analysis. 
Before describing the typical set estimator, we introduce the 
definition of the typical set (7] as follows. 

Definition 2 (Typical set): Assume that an i.i.d. random 
variables Ai(i G [1, n]), a positive constant e and a positive 
integer n are given. The typical set T n<e is defined by 



A 



{(«! 



l ) £A n | \H{A)-n{ ai ,...,a n )\ < e} , 

(10) 

where A is the finite alphabet of Ai and H(A) = H(Ai) holds 

for i G [1, n]. The function n is defined by «(ai, . . . , a n ) = 
(-l/n)log 2 Pr(a 1 ,...,a n ). □ 

The typical set estimator defined below is almost same as 
the typical set decoder assumed in a proof of several coding 
theorems such as lfl5l . It is exploited for a simpler proof and 
it is, in general, computationally infeasible for carrying out. 
Despite of its computational complexity, the performance the 
typical set estimator can be used as a benchmark of other 
estimation algorithms. 

Definition 3 (Typical set estimator): Assume the noiseless 
system. Suppose that an (I, r, n)-pooling graph G G Gi^ r . n 
and a positive real value e are given. The typical set estimator 
$ : {0, l} m -» {0, 1}" U {E} is defined by 

x€D{y), if \D(y)\ = 1, 
E, otherwise, 



A 



(ID 



where D(y)(y G 

A 



{0, 1}'™) is the decision set defined by 
| y = Fc(x)}. The symbol E represents 



D(y) = {x G T„, e 
failure of estimation. q 
The typical set estimator $ depends on the bins defined on 
the typical set T„. e . A bin D(y) consists of the inverse image 
of y in the typical set. For an observed vector y, if the 
cardinality of the bin D(y) is 1, the estimator declare that 
x G D(y) has occurred. The failure of estimation occurs when 
the cardinality of D(y) is greater than 1. For evaluating the 
error probability of the typical set estimator, an analysis for 
this event is indispensable and it will be the main topic of the 
following analysis. 

The next lemma provides the existence of a pair (G, "P) 
achieving a given upper bound on the error probability, which 
can be regarded as a counter part of the direct part of a coding 
theorem. 

Lemma 3: Assume the noiseless system. For any given 7 > 
0, if 

((1 + zY-i 



(i - i)h(p) 



max 

<j£[0,l/r] 



log 2 inf 



z>0 



yip 



holds, then there exists a pair (G G Gi tU ,r, & 
probability smaller than 7. 



+ 7 < 

(12) 
with the error 

□ 



(Proof) The proof is based on the bin-coding argument. 
Assume that a positive real number e is given(later, we will 
see that e is determined according to 7 but, for a while, we 
consider that e is a given parameter). Note that there are two 
events that the typical set estimator fails to estimate correctly. 
The event in which a realization of X, x, is not a typical 
sequence is denoted by Event I. Another event, Event II, 
corresponds to the case where a realization a; is a typical 
sequence but \D(F G (x))\ > 1 holds. 

We therefore have 

P e = Pr[X ± *{F G {X))\ = Pi + Pn(G), (13) 

where Pj 1 Pjj(G) are probabilities corresponding to Event I 
and II, respectively. Note that the probability Pj depends only 
on the parameters n and e. 

We first consider the probability Pjj(G), which can be 
upper bounded as follows: 

Pn(G) 

= Pr(x)l[3x' eT v ,i' eD(F G (x)),x' ^x] 

< Pr ( x ) £ I[F G (x) = F G (x')}. (14) 

By taking the expectation of ( TT4b over the (I, r, n) -pooling 
graph ensemble, we obtain 

E[Pu(G)] 

< ]T Pr(x) £ E[I[F G (x) = F G (x')}] 



< \T n _ f 



x' eT„ tt ,x^x' 



max 



where w m i n and w max are defined by 

Wrnax = max wt(x), W m 
x£T n e 



E[I[y s = F G (x w )Ul5) 



min wt(x), (16) 

x£T n e 



where wt{x) represents the Hamming weight of x. The vector 
y s is an arbitrary binary m-tuple with weight s and x w is an 
arbitrary binary ri-tuple with weight w. 

Applying the upper bound on the size of the typical set and 
Lemma [2] to ( fT5l ). we have 



max max 



T(n, I, r, w, s) 
(17) 

where T(n,l,r,w,s) = 7 ^rrCoeff[((l + z) r - l) s ,z lw ]. By 

\ml) 

letting oj = w/n and a = s/n, the above inequality ( fTTI i can 
be rewritten as E[P H (G)} < 2™(' 1 (p)+ £ +Q) where 



Q 



A 



log 2 



max max 



T(n, I, r, m;, s)\ 



For evaluating the coefficient of the generating function, the 
theorem by Burshtein and Miller [3| can be exploited and Q 
can be expressed as 

((l + zy-iy 



Q = -h{p) 



max 

a£[0,l/r] 



log 2 inf 

z>0 



yip 



(18) 



where £(e) is a function of e such that £(e) — > as e — > and 
the function S(n) is a function such that 5(n) — > as n — >• oo. 
Assume that a positive real number 7 is given and 



(I - l)h(p) 



max 

<7G[0,Z/r] 



log 2 inf 



((i + zY-ir 



z>0 



yip 



+ 7 < 
(19) 

holds. For sufficiently large n and sufficiently small e, there 
exists a pair (n, e) satisfying e+5(n)+£(e) < 7 and the follow- 
ing two conditions. The first condition is that E[Pu(G)} < ?. 
Note that, due to the assumption ( fl9l i, the exponential growth 
rate of the upper bound on E[Pj i(G)] is negative and thus the 
upper bound on E[Pu(G)} can be arbitrary small as n — ► 00. 
The second condition is that Pi < 7/2 which is guaranteed 
by the asymptotic equi-partition property (AEP) for the typical 
set Q. As a result, we have E[P E ] = Pi + E[P H (G)] < 7 
and this implies the existence of a pair (G G Gi <n>r , $) with 
the error probability smaller than 7. j-j 

From this lemma, we can immediately derive the following 
direct part result. 

Theorem 3 (Achievability for noiseless system ): Assume the 
noiseless system. For any given 7 > 0, if 

-(l-l)h(p)-lp\og 2 (2 1 / r -l)+ 1 <0 

holds, then there exists a pair (G G G^ n ,r, with the error 
probability smaller than 7. rj 
From Theorem Q] and Theorem [3] it is natural to conjecture 
the existence of the threshold value p*(l,r) partitioning 
the range of p into two regions. Namely, if p < p*(l,r), 
arbitrary accurate estimation is possible. Otherwise, i.e., 
p > p*(l,r), no estimator achieving arbitrary small error 
probability exists in the asymptotic limit n — >■ 00. An 
upper bound on the threshold can be obtained from 
Theorem[T] The upper bound p\j(l, r) is given by py{l, r) = 
inf {p I p satisfies h(p) — (l/r)h((l — p) r > 0} . On the other 

hand, a lower bound on the threshold is defined by p* L (l, r) = 
sup {p I p satisfies — (I — l)h(p) - /plog 2 (2 1 / r - 1) < 0} , 
which is a direct consequence of Theorem [3] Table [Q presents 
the values of the lower and upper bounds on the threshold 
for the two cases l/r = 1/2 and l/r = 1/4. 



TABLE I 

Threshold bounds for noiseless systems 



(,l,r) 


P?,{l,r) 


Pn(i,r) 


(2,4) 


0.092763 


0.097350 


(3,6) 


0.110022 


0.110023 


(4,8) 


0.104629 


0.105999 


(5, 10) 


0.096091 


0.099480 


(6, 12) 


0.087848 


0.093027 


(2,8) 


0.022022 


0.026824 


(3, 12) 


0.038651 


0.039535 


(4, 16) 


0.041685 


0.041687 


(5, 20) 


0.040693 


0.040978 


(6, 24) 


0.038556 


0.039427 



C. Analysis on error probability for noisy system 

As in the case of the noiseless system, we can derive a 
direct part theorem for the noisy system shown below. 

Theorem 4 (Achievability for noisy system ): Assume the 
noisy system. For any given 7 > 0, if — (I — l)h(p) + -h(q) — 
z , plog 2 (2 1 /' r — 1) + 7 < holds, then there exists a pair (G G 
Gi t n,ri ( i > ) with the error probability smaller than 7. rj 

V. Conclusion 

There are strong similarity between group testing schemes 
and linear error correction schemes for binary symmetric 
channels. The analysis presented in the paper is inspired from 
the theoretical works on LDPC codes [T5] J4] . From numerical 
evaluation, it is shown that the gap between the upper bound 
Pjj(l,r) and lower bound p* L (l,r) is usually quite small. It 
suggests the existence of the sharp threshold which is similar 
to the Shannon limit for a channel coding problem. 
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